Skip to content

feat(ssh): wire per-host auth-method learning into discovery/intelligence/liveness#575

Merged
remyluslosius merged 1 commit into
mainfrom
feat/ssh-learning-allpaths
Jun 16, 2026
Merged

feat(ssh): wire per-host auth-method learning into discovery/intelligence/liveness#575
remyluslosius merged 1 commit into
mainfrom
feat/ssh-learning-allpaths

Conversation

@remyluslosius

Copy link
Copy Markdown
Contributor

What

Extends the per-host SSH connection memo from PR #566 (which only learned on the compliance-scan path) into the three remaining paths that talk to a managed host:

  • OS discovery (internal/intelligence/discovery)
  • OS intelligence collection (internal/intelligence/collector)
  • liveness privilege probe (internal/sshprivilege)

Each now leads the SSH dial with the host's last known-good auth method and records which method actually authenticated, via the shared connprofile store.

Why

Without the memo, those paths dialed key-first every cycle. On a password-only host that re-offers an unauthorized public key on a loop (discovery on add, the collector every intelligence cycle, the liveness probe every ~5 min) — a failed publickey attempt that counts against MaxAuthTries and can trip fail2back. This was the user's original "try SSH key next time / try password next time" ask, applied to the paths #566 left out.

How

  • connprofile.WithHostID / HostIDFrom — context helpers so a transport that only receives host:port + cred can still look up + record the host's profile, without changing the SSHTransport.Dial signature (which would churn the discovery + collector interfaces and 8 test stubs). Best-effort enrichment: no host id on the ctx ⇒ no learning, default dial order.
  • discovery.SSHTransportProd.WithProfiles + a dial seam — reads the profile, sets DialOptions.PreferAuth, records ObservedAuth after a successful dial. This one prod transport is what both discovery and the collector dial through (collector via collectorSSHAdapter), so both learn from a single wiring.
  • discovery.go / collector.go wrap the ctx with the host id at the dial site (both hostFacts already carry HostID).
  • sshprivilege.Probe.WithProfiles — leads the dial with the recorded method (reordering buildAuthMethods for AuthBoth; both methods stay offered) and records what authenticated via a small local single-factor observer that mirrors internal/ssh's.
  • cmd/openwatch wires one shared connprofile.NewStore(pool) across all four paths (the scan reuses it now too).

Learning is a hint, not a lock: a missing host id, an absent profile row, or a store error dials in the default order and never fails the connection — a stale hint self-heals on the next dial.

Scope

This PR covers the SSH auth-method dimension. sudo-mode (NOPASSWD vs password) learning for these three paths stays a follow-up — they already probe sudo mode correctly each cycle; only the compliance scan learns both dimensions today. Documented in the spec's excludes.

Spec / tests

  • system-connection-profilev1.1.0: adds C-06 and AC-08 (discovery/collector transport) + AC-09 (liveness probe).
  • New tests: TestSSHTransportProd_AuthLearning (4 cases: lead-with + record, no host id, no store, lookup-error non-fatal) and TestPrivilegeProbe_AuthLearning (lead-with + record / no-store). Existing TestBuildAuthMethods updated to the new signature.
  • gofmt/go vet/go build clean; touched-package suites green against the isolated test DB; specter check 0 errors; system-connection-profile 9/9 ACs have results, PASSES tier 2.

…ence/liveness

The connection-profile memo (PR #566) only led the dial with the host's
known-good SSH auth method on the compliance-scan path. The other three
paths that talk to a managed host -- OS discovery, OS intelligence
(collector), and the liveness privilege probe -- still dialed key-first
every cycle, re-offering an unauthorized public key to password-only
hosts (a failed publickey attempt that counts against MaxAuthTries and
can trip fail2ban) on a loop.

Extend the shared connprofile store into those three paths:

- connprofile.WithHostID / HostIDFrom: context helpers so a transport
  that only receives host:port+cred can still look up + record the
  host's profile, without churning the SSHTransport.Dial signature (and
  its test stubs across discovery + collector).
- discovery.SSHTransportProd gains WithProfiles + a dial seam: when a
  store is wired and the ctx carries a host id, it sets PreferAuth from
  the recorded method and records ObservedAuth after a successful dial.
  This one transport is what BOTH discovery and the collector dial
  through (via collectorSSHAdapter), so both learn at once.
- discovery.go / collector.go wrap the ctx with the host id at the dial
  site (both hostFacts already carry HostID).
- sshprivilege.Probe gains WithProfiles: it leads the dial with the
  recorded method (reordering buildAuthMethods for AuthBoth) and records
  which method authenticated via a local single-factor observer.
- cmd/openwatch wires one shared connprofile.NewStore(pool) across all
  four paths (the scan now reuses it too).

Learning stays best-effort: a missing host id, absent profile row, or
store error dials in the default order and never fails the connection
(hint, not a lock -- a stale hint self-heals on the next dial).

Scope: the SSH auth-method dimension. sudo-mode (NOPASSWD vs password)
learning for these three paths stays a follow-up -- they already probe
sudo mode correctly each cycle; only the scan learns both today.

Spec system-connection-profile -> v1.1.0: C-06, AC-08 (discovery/
collector transport), AC-09 (liveness probe).
@remyluslosius remyluslosius merged commit 0e2096c into main Jun 16, 2026
13 checks passed
@remyluslosius remyluslosius deleted the feat/ssh-learning-allpaths branch June 16, 2026 15:51
remyluslosius added a commit that referenced this pull request Jun 16, 2026
…ce/liveness (#576)

Follow-up to PR #575 (auth-method learning). The same three paths that
talk to a managed host still probed sudo mode from scratch every cycle —
running a doomed `sudo -n` on a password-sudo host before retrying
`sudo -S`. Extend the connprofile memo to the SUDO dimension so each path
leads with the host's recorded mode and records the mode confirmed to
work.

Division of labour:
- The liveness privilege probe is the AUTHORITATIVE sudo-mode learner: it
  runs an innocuous `true` sentinel every ~5 min, so it reliably confirms
  the mode regardless of any real command's exit code.
- OS discovery (firewall probe) and OS intelligence (collector) learn
  OPPORTUNISTICALLY from their existing real sudo commands — no extra
  round-trip. To avoid misrecording, a mode is recorded ONLY on a
  confirmed exit-0 of a given sudo form, never inferred from a command
  that failed for its own reasons.

Mechanics:
- ssh.RunSudo (collector's shared primitive) gains a prefer (in) + observed
  (out): when prefer=SudoPassword and the password gate is satisfied it
  leads with `sudo -S`, skipping the doomed `sudo -n`; observed reports the
  confirmed form. Plain-string tokens (SudoNopasswd/SudoPassword) keep the
  ssh package decoupled from connprofile, matching PreferKey/PreferPassword.
- discovery.runSudoWithFallback + probeFirewall thread prefer + return the
  learned mode; the discovery Service reads it (cfg.prefer) and records it.
- sshprivilege.Probe extracts probeSudo: leads with `sudo -S` on a known
  password host, records the confirmed mode (preserving the AC-18/19/21
  error shapes).
- collector threads the recorded mode across the cycle's sudo commands and
  records once at the end.
- cmd/openwatch wires the shared connprofile store into the collector and
  discovery services (the probe already had it from #575).

The sudo password GATE (kill-switch + auth-method) is unchanged: leading
with `sudo -S` is allowed only when a password may already be fed. Learning
stays best-effort — a store miss/error escalates in the default order and
never fails the connection; a stale mode self-heals (sudo -S miss falls
back to sudo -n).

Spec system-connection-profile -> v1.2.0: C-07, AC-10 (RunSudo primitive),
AC-11 (discovery firewall probe), AC-12 (liveness probe).
remyluslosius added a commit that referenced this pull request Jun 16, 2026
Closes the project's biggest test blind spot: the dial, auth-ordering,
and sudo -n/-S paths were only unit-tested at the command-construction
level (stubbed transport), never against a real box. A wired-up host
could regress and every test stay green.

internal/ssh/livehost_test.go drives the REAL ssh.Dial + ssh.RunSudo —
the primitives every host-talking path (scan, discovery, collector,
liveness) shares — against an operator-supplied inventory:

  OPENWATCH_LIVE_HOSTS=/path/to/test_hosts.csv  (hostname,ip,username,credential)
  OPENWATCH_LIVE_KEY=/path/to/id_rsa

With either unset the test t.Skip()s, so it never gates normal CI; the
inventory + key stay on the operator's workstation, never in the repo.

The fleet is heterogeneous, so the test DISCOVERS each host's
capabilities rather than demanding every method everywhere. Per host it
asserts the machinery for whatever the host supports:

  - key auth dials      -> ObservedAuth == "key"      (the value the memo records)
  - password auth dials -> ObservedAuth == "password"
  - sudo mode confirmed via the `true` sentinel (nopasswd | password)
  - the real `sudo -S -k -p '' true` password-on-stdin path executes

A server-side auth rejection (key not authorized, or PasswordAuthentication
off) is a tolerated host-config fact; an unreachable host is skipped; only
an unexpected protocol-level error or a wrong ObservedAuth/sudo result
fails the test. A host with no usable auth is skipped.

Validated against the dev fleet: 5 key+NOPASSWD hosts pass (real key dial,
sudo -n, and sudo -S all exercised), key-rejecting and unreachable hosts
skip. The password-AUTH assertion is live-unverified only because the dev
fleet runs PasswordAuthentication=no everywhere (noted in BACKLOG); it
runs as soon as one password-enabled host is in the inventory.

Also drops the completed "wire SSH auth/sudo learning" backlog entry
(shipped in #575 + #576).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant